20 Cluster Analysis - Practical

20.1 Introduction

Previously, we covered two main forms of cluster analysis (hierarchical and partitioning) and established a basic understanding of why cluster analysis can be such a useful technique for understanding and interpreting our data.
In this practical we will get some hands-on experience of running these two forms of cluster analysis in R.

20.2 Hierarchical clustering: demonstration

Step One: Loading the data

In this walk-through, we will use the dataset available for download using the following code:

Show code

df <- read.csv('https://www.dropbox.com/scl/fi/4ka6a7tbz03tdfnu5fb2f/cluster_data_01.csv?rlkey=i5gi1k9ydgxdit65ck98ak3tp&dl=1')

head(df,10) # display the first six rows

    X    AceRate DoubleFaultsRate  WinRate FirstServeAccuracy
1  53  2.9659327         1.867427 81.07269           80.14146
2  24  6.5422175         6.368602 76.44377           64.72219
3  68  2.2119804         3.799219 50.54591           79.22866
4  67 -0.0142105         2.635963 63.78168           71.12928
5  11 10.4481636         4.305293 65.08969           61.44797
6  74  1.3984933         4.056017 68.25380           74.85579
7   5  8.2585755         5.821581 59.28209           71.80326
8  32  3.7690422         2.631593 81.21844           78.49340
9  86  2.6837455         4.333903 73.78570           61.78426
10 85  1.4770876         3.106792 64.76133           72.93162
   BreakPointsSavedRate
1              78.01684
2              44.85332
3              80.05565
4              91.98726
5              74.44551
6              74.97182
7              78.43862
8              76.73770
9              80.82661
10             76.77443

Show code

summary(df)

       X            AceRate         DoubleFaultsRate    WinRate     
 Min.   : 1.00   Min.   :-0.01421   Min.   :1.345    Min.   :42.10  
 1st Qu.:23.25   1st Qu.: 2.37744   1st Qu.:2.062    1st Qu.:62.96  
 Median :45.50   Median : 3.32134   Median :3.848    Median :72.42  
 Mean   :45.50   Mean   : 4.37620   Mean   :3.692    Mean   :70.04  
 3rd Qu.:67.75   3rd Qu.: 6.60527   3rd Qu.:4.952    3rd Qu.:78.08  
 Max.   :90.00   Max.   :11.57383   Max.   :7.169    Max.   :90.50  
 FirstServeAccuracy BreakPointsSavedRate
 Min.   :56.66      Min.   :39.47       
 1st Qu.:66.34      1st Qu.:61.47       
 Median :74.46      Median :72.26       
 Mean   :72.81      Mean   :70.23       
 3rd Qu.:78.82      3rd Qu.:79.08       
 Max.   :86.88      Max.   :91.99

Show code

str(df)

'data.frame':   90 obs. of  6 variables:
 $ X                   : int  53 24 68 67 11 74 5 32 86 85 ...
 $ AceRate             : num  2.9659 6.5422 2.212 -0.0142 10.4482 ...
 $ DoubleFaultsRate    : num  1.87 6.37 3.8 2.64 4.31 ...
 $ WinRate             : num  81.1 76.4 50.5 63.8 65.1 ...
 $ FirstServeAccuracy  : num  80.1 64.7 79.2 71.1 61.4 ...
 $ BreakPointsSavedRate: num  78 44.9 80.1 92 74.4 ...

This dataset contains 90 observations, each representing a tennis player. For each player, we have the following information:

AceRate: Average number of aces per match.
DoubleFaultsRate: Average number of double faults per match.
WinRate: Win rate in percentages.
FirstServeAccuracy: First serve accuracy in percentages.
BreakPointsSavedRate: Rate of break points saved in percentages.

Step Two: Computing the distance matrix

As noted previously, hierarchical clustering requires us to calculate a distance matrix.

There are a number of different measures of distance that can be used (Euclidean, Manhattan etc.).

In this walk-through, we’ll use Euclidean distance.

Show code

# calculate a distance matrix using Euclidean distance
dist_matrix <- dist(df, method = "euclidean")

What’s Euclidean distance?

Euclidean distance is used to measure the straight-line distance between two points in Euclidean space, which is the familiar two-dimensional or three-dimensional space we encounter in everyday geometry.

It’s calculated using the Pythagorean theorem, which involves squaring the differences between corresponding coordinates of the points and then taking the square root of the sum of these squares.

Step Three: Visualise the distance matrix

It’s useful to visualise the distance matrix before continuing to further analysis.

Show code

# using the pheatmap package

library(pheatmap)
pheatmap(dist_matrix)

Show code

# Hierarchical clustering
hc <- hclust(dist_matrix) # performs a basic hierarchical clustering

# Plotting the dendrogram
plot(hc, main="Dendrogram of Hierarchical Clustering")

From the dendrogram, it looks like our data might form three clusters. We can check that further in the next stage.

Step Four: Running a hierarchical cluster analysis

We can now use the distance matrix to perform the clustering. There are different methods available to do this; in this walk-through, we’ll use ward.D.

Show code

hc <- hclust(dist_matrix, method = "ward.D")

# hc is now in our environment

Step Five: Cut the tree to form clusters

We can now determine the number of clusters, and ‘cut’ the dendrogram.
There’s a section below on how we can determine the best number of clusters to ‘cut’.

Show code

cutree(hc, k = 3)  # Cuts the tree into 3 clusters

 [1] 1 2 3 3 2 3 2 1 3 3 3 1 1 3 1 1 3 1 2 3 1 1 3 1 2 1 2 2 1 1 3 3 3 2 2 2 1 1
[39] 3 1 2 1 1 2 1 2 1 2 2 2 2 1 1 3 2 3 3 3 3 2 2 1 3 1 3 2 2 3 1 3 2 3 1 1 1 2
[77] 2 2 2 1 2 3 3 2 1 3 1 3 2 3

Step Six: Visualising the Dendrogram

We can the results using a dendrogram.

Show code

plot(hc)

For enhanced visualisation, you can use the dendextend package.

Show code

library(dendextend)


---------------------
Welcome to dendextend version 1.17.1
Type citation('dendextend') for how to cite the package.

Type browseVignettes(package = 'dendextend') for the package vignette.
The github page is: https://github.com/talgalili/dendextend/

Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
You may ask questions at stackoverflow, use the r and dendextend tags: 
     https://stackoverflow.com/questions/tagged/dendextend

    To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
---------------------


Attaching package: 'dendextend'

The following object is masked from 'package:stats':

    cutree

Show code

dend <- as.dendrogram(hc)
dend <- color_branches(dend, k = 3)
plot(dend)

The dynamicTreeCut package is also useful:

Show code

# load the dynamicTreeCut package
library(dynamicTreeCut)

# Cut the dendrogram tree
clusters <- cutreeDynamic(dendro=hc, distM=as.matrix(dist_matrix), method="hybrid")

 ..cutHeight not given, setting it to 1180  ===>  99% of the (truncated) height range in dendro.
 ..done.

Show code

# Plot the dendrogram with color-coded clusters
plot(hc, labels=FALSE, hang=-1, main="Dendrogram with Dynamic Tree Cut")
rect.hclust(hc, k=length(unique(clusters)), border="red")  # Add colored borders

Silhouette plots provide a visual representation of how close each point in one cluster is to points in the neighboring clusters. This helps in assessing the quality of clustering.

Show code

# load the cluster package
library(cluster)

# Compute silhouette information
sil <- silhouette(cutree(hc, k=3), dist_matrix)

# Plotting the silhouette
plot(sil, col=1:max(sil[,3]), border=NA, main="Silhouette Plot")

Step Seven: Analysing the results

Examine the clusters formed.
You can attach the cluster labels back to your original data for further analysis.

Show code

# this command labels each observation depending on which cluster it belongs to
df$cluster <- cutree(hc, k = 3)

Now, when you view the dataset df, you will see that each player has been ‘attached’ to one of the three clusters.
This information could then be used to perform between-group analysis of difference (e.g. ANOVA) on specific measures.
Or you can use the cluster information to produce plots based on group membership.

Show code

# Example R code for scatterplot

df$cluster <- as.factor(df$cluster)  # make sure R knows that cluster is a factor (grouping variable)

library(ggplot2)
ggplot(df, aes(x = AceRate, y = WinRate, color = cluster)) + 
  geom_point() +
  labs(title = "Scatterplot of AceRate vs WinRate by Cluster", x = "AceRate", y = "WinRate") +
  theme_minimal()

There are lots of interesting things you can do with this information. Note this doesn’t work in Quarto, but you should be able to run it on your computer (remove the comment signs).

#library(plotly)

#fig <- plot_ly(df, x = df$DoubleFaultsRate, y = df$WinRate, z = df$FirstServeAccuracy, color = df$cluster, colors = colorspace::rainbow_hcl(6))
#fig <- fig %>% add_markers()
#fig <- fig %>% layout(scene = list(xaxis = list(title = 'Variable 1'),
#                                   yaxis = list(title = 'Variable 2'),
#                                   zaxis = list(title = 'Variable 3')))
#fig

Additional analysis

Depending on your needs, you might perform additional analysis, like:

Checking the characteristics of each cluster.
Conducting statistical tests to understand if clusters differ significantly.
Using principal component analysis (PCA) to reduce dimensions for visualisation.

Determining the number of clusters

You might be wondering ‘why did we choose to identify three clusters’? This is a complex topic and slightly beyond the scope of this practical.

However, some common approaches include:

Dendrogram Analysis. One common method is to visually inspect the dendrogram produced by the hierarchical clustering. Look for the “natural” breaks in the dendrogram where there is a large jump in the distance at which clusters are merged.
Inconsistency Method. This method examines the inconsistencies in the distances at which clusters are merged in the dendrogram. A threshold inconsistency value can be used to decide the number of clusters.
Elbow Method. Although more commonly associated with k-means clustering, the elbow method can also be adapted for hierarchical clustering. It involves plotting the percentage of variance explained as a function of the number of clusters. The point where the marginal gain drops, creating an “elbow” in the graph, indicates the optimal number of clusters. There’s an example below.
Gap Statistic. The gap statistic compares the total within intra-cluster variation for different numbers of clusters with their expected values under null reference distribution of the data. The optimal number of clusters is the value that maximizes the gap statistic.
Silhouette Analysis. This method involves calculating the silhouette coefficient for different numbers of clusters. The silhouette coefficient measures how similar an object is to its own cluster compared to other clusters. The optimal number of clusters usually maximizes the average silhouette score.
Domain Knowledge. Often, the choice of the number of clusters is guided by domain knowledge or the specific requirements of the task at hand. For example, in market segmentation, the number of clusters might be determined based on the number of different marketing strategies the team can feasibly implement.
Stability Analysis. Stability-based methods involve comparing the results of cluster analysis on the original data with those on perturbed or bootstrapped samples. If a particular number of clusters consistently emerges across different samples, it may be considered the optimal number.
Statistical Criteria. Some statistical criteria like the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC) can be used to compare models with different numbers of clusters. These criteria balance the goodness-of-fit of the model with the complexity of the model.

The final approach, statistical criteria, can be very helpful, as it allows you to compare different models.

There are some mathematical issues with using AIC and BIC in hierarchical models (i.e. no defined likelihood function) which makes them slightly less reliable. They should be used alongside other approaches, such as visual inspection of the dendrogram.

In the following example, I’ve used AIC and BIC to evaluate models from 2 to 5 clusters. In the resulting plot, you’ll see that AIC is highest with 3 clusters, suggesting we were ‘right’ to use this number previously.

Show code

library(mclust)

Package 'mclust' version 6.0.1
Type 'citation("mclust")' for citing this R package in publications.

Show code

library(cluster)

# Assuming 'hc' is your hierarchical clustering model
# and 'df' is your original data

aic_values <- numeric()
bic_values <- numeric()

max_k <- 5

for (k in 2:max_k) {  # max_k is the maximum number of clusters you want to consider
    # Cut the dendrogram to get k clusters
    clusters <- cutree(hc, k)
    
    # Fit a Gaussian mixture model (as a proxy) to the data for the given number of clusters
    gmm <- Mclust(df, G = k)
    
    # Store the AIC and BIC values
    aic_values[k] <- AIC(gmm)
    bic_values[k] <- BIC(gmm)
}

# Plot AIC and BIC values to find the optimal number of clusters
plot(2:max_k, aic_values[2:max_k], type = "b", col = "blue", xlab = "Number of clusters", ylab = "AIC")
points(2:max_k, bic_values[2:max_k], type = "b", col = "red")
legend("bottomright", legend = c("AIC", "BIC"), col = c("blue", "red"), lty = 1)

20.3 Hierarchical Clustering: Practice

Note

It’s time to put what you’ve learned into practice. There are two parts to this section. The Core Task is designed to replicate the steps outlined above, and give you confidence in undertaking a basic hierarchical cluster analysis of a dataset. The second is intended as an Extension Task, and is optional. This will take you further into the topic, but may require significant further reading.

Core task

Objective: The goal of this task is to familiarize you with the process of conducting a hierarchical cluster analysis using R and to gain confidence in preparing data, performing the analysis, and interpreting the results.

The dataset is available here:

df <- read.csv('https://www.dropbox.com/scl/fi/xra55o86nh6fljbyez4yv/cluster_data_02.csv?rlkey=0d1p1k4pk90ftmjie0p6e6lvt&dl=1')

The dataset contains data for 200 golfers, with four attributes (variables): - DrivingAccuracy - PuttingAccuracy - AverageScore - TournamentsWon

Task Steps:

Data Preparation
- Import the dataset.
- Check the data for missing values and outliers.
Exploratory Data Analysis
- Perform basic statistical analyses (mean, median, standard deviation).
- Visualise the data using plots (scatter plots, box plots).
Distance Computation
- Calculate the distance matrix using an appropriate distance measure (e.g., Euclidean).
Hierarchical Clustering
- Perform hierarchical clustering using methods like Ward’s method or single linkage.
- Generate a dendrogram to visualise the clustering.
Determining the Number of Clusters
- Use methods like dendrogram inspection, statistical approaches like AIC/BIC, the elbow method, or silhouette analysis to determine the optimal number of clusters.
Cluster Analysis
- Cut the dendrogram to form the identified number of clusters.
- Analyse the characteristics of each cluster.
Interpretation and Reporting
- Interpret the results in the context of the data.

Extension task (optional)

Objective:

This extension task aims to deepen your understanding of hierarchical cluster analysis by introducing advanced concepts and applications. You could explore different techniques and compare hierarchical clustering with other clustering methods.

Task Steps:

Advanced Clustering Techniques
- Explore different linkage methods (complete, average, centroid) and compare their impact on the clustering results.
Comparative Analysis
- Perform k-means clustering on the same dataset.
- Compare the results of k-means and hierarchical clustering. Discuss the pros and cons of each method.
Cluster Validation
- Implement internal validation measures such as the Dunn index or Davies-Bouldin index to assess the quality of the clusters.
Dimensionality Reduction
- Apply Principal Component Analysis (PCA) to reduce the dataset to two or three dimensions.
- Perform hierarchical clustering on the reduced dataset and compare the results with the original high-dimensional clustering.
Statistical Analysis and Interpretation
- Perform a statistical analysis to understand the significance of the clusters.
- Provide a detailed interpretation of the clusters in the context of the complex dataset.

20.4 K-means Clustering: Demonstration

K-means clustering is an unsupervised learning technique used for partitioning a dataset into K distinct, non-overlapping clusters.

The following walk-through provides a step-by-step process for conducting a basic k-means cluster analysis.

Step One: Load packages and data

Along with the stats package that comes as part of base R, we need to load two packages, dplyr and ggplot2.

Show code

library(dplyr)
library(ggplot2)

For this walk-through, we’re going to create a dataset called athletes_data by downloading a remote file:

Show code

athletes_data <- read.csv('https://www.dropbox.com/scl/fi/f1czg6gk1mvhwmms5j4gz/k_means_01.csv?rlkey=jc4xak7rkgf67omdpri9ymygu&dl=1')

head(athletes_data,10) # display the first six rows

    X    speed endurance   strength  agility athlete_id
1   1 5.782400 7.7610291  4.7871517 9.808228          1
2   2 5.218464 9.0840327  2.7962335 6.346104          2
3   3 8.516691 3.0530042  2.7185880 7.516219          3
4   4 6.169188 4.6751561  7.6917140 9.694183          4
5   5 8.139396 5.9672351  4.9033139 8.520897          5
6   6 9.910791 2.7795134 -0.1064944 7.763720          6
7   7 6.543194 9.0452483  0.8949734 6.116637          7
8   8 5.584718 3.9640351  3.9583267 8.666036          8
9   9 7.109848 9.3421547  0.3224514 5.536013          9
10 10 9.391013 0.6908311  2.1050508 8.065293         10

Show code

summary(athletes_data)

       X              speed          endurance          strength     
 Min.   :  1.00   Min.   : 1.357   Min.   : 0.5019   Min.   :-1.104  
 1st Qu.: 50.75   1st Qu.: 4.959   1st Qu.: 3.3706   1st Qu.: 3.098  
 Median :100.50   Median : 6.324   Median : 4.5424   Median : 5.018  
 Mean   :100.50   Mean   : 6.570   Mean   : 5.3189   Mean   : 5.431  
 3rd Qu.:150.25   3rd Qu.: 8.664   3rd Qu.: 8.1831   3rd Qu.: 8.391  
 Max.   :200.00   Max.   :10.584   Max.   :10.1465   Max.   :11.215  
    agility         athlete_id    
 Min.   :-2.620   Min.   :  1.00  
 1st Qu.: 4.908   1st Qu.: 50.75  
 Median : 6.944   Median :100.50  
 Mean   : 6.560   Mean   :100.50  
 3rd Qu.: 8.859   3rd Qu.:150.25  
 Max.   :11.241   Max.   :200.00

Show code

str(athletes_data)

'data.frame':   200 obs. of  6 variables:
 $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
 $ speed     : num  5.78 5.22 8.52 6.17 8.14 ...
 $ endurance : num  7.76 9.08 3.05 4.68 5.97 ...
 $ strength  : num  4.79 2.8 2.72 7.69 4.9 ...
 $ agility   : num  9.81 6.35 7.52 9.69 8.52 ...
 $ athlete_id: int  1 2 3 4 5 6 7 8 9 10 ...

This dataset contains athlete attributes such as speed, endurance, strength, and agility, which are commonly assessed in athletes.
We know that the data was collected from a number of different groups, including sprinters, marathon runners, weightlifters, and gymnasts.

Step Two: Determine the optimal number of clusters

Before running k-means, you should decide how many clusters (K) to use. One common method is the ‘Elbow’ method.

First, we need to exclude the athlete_id column as it’s not relevant for clustering.

Show code

clustering_data <- athletes_data[, c("speed", "endurance", "strength", "agility")]

Now, we can use the Elbow Method to determine optimal clusters.

The Elbow method involves running k-means clustering on the dataset for a range of values of k (number of clusters) and then plotting the total within-cluster sum of squares (WSS).

Show code

# Determine max number of clusters you want to test
max_clusters <- 10

# Calculate total within-cluster sum of square
wss <- sapply(1:max_clusters, function(k){
  kmeans(clustering_data, centers=k, nstart=10)$tot.withinss
})

# Plot the Elbow Plot
plot(1:max_clusters, wss, type="b", xlab="Number of clusters", ylab="Total within-cluster sum of squares")

To interpret this plot, find the “elbow” point in the plot. This point represents a balance between minimizing WSS and having a smaller number of clusters. It’s the point beyond which the reduction in WSS starts diminishing. In the plot above, this looks to be 4 clusters (the slope becomes much flatter after this).

Choosing the number of clusters often requires a bit of subjective judgment. In some datasets, the elbow might not be very pronounced, and in such cases, alternative methods like the silhouette method can be used.

It’s also important to remember that the Elbow method is heuristic and provides an approximate number of clusters. The actual number might vary depending on the specific context and requirements of the analysis.

Caution

The term “heuristic” refers to a practical approach to problem-solving, learning, or discovery that employs a method not guaranteed to be optimal or perfect, but sufficient for reaching an immediate, short-term goal or approximation.

Heuristics are often used in situations where an exhaustive search or precise solution is impractical or impossible due to time or resource constraints.

Step Four: Run the K-means algorithm

Now, using the determined number of clusters (let’s say 4 for this dataset) we can perform k-means clustering.

Show code

# Select only the relevant columns for clustering
clustering_data <- athletes_data[, c("speed", "endurance", "strength", "agility")]

# Perform k-means clustering with 4 clusters
set.seed(123) # Set a random seed for reproducibility
kmeans_result <- kmeans(clustering_data, centers = 4, nstart = 20)

# View the results
print(kmeans_result)

K-means clustering with 4 clusters of sizes 50, 50, 58, 42

Cluster means:
     speed endurance  strength  agility
1 3.934395  4.277306 10.013447 2.777013
2 6.030547  9.158377  2.918393 5.952259
3 9.280093  2.994741  3.697942 8.256597
4 6.607644  5.197487  5.357964 9.445326

Clustering vector:
  [1] 4 2 3 4 4 3 2 4 2 3 3 4 1 4 1 3 3 3 4 2 3 2 1 2 2 4 4 3 3 2 2 1 1 4 3 4 2
 [38] 3 1 4 4 1 1 2 1 4 1 2 3 4 3 3 3 4 4 2 4 1 2 4 4 3 1 3 4 4 1 4 2 1 2 3 1 2
 [75] 3 3 3 3 3 4 1 3 2 3 4 2 3 2 1 3 3 2 1 3 3 3 2 2 1 3 2 2 1 4 3 2 3 1 3 3 1
[112] 2 3 2 1 3 4 2 4 4 2 3 3 2 1 1 2 2 1 3 4 1 1 3 3 2 2 2 2 2 3 1 2 2 3 4 4 2
[149] 3 1 3 2 2 1 3 4 1 3 1 4 3 1 1 2 1 3 2 2 3 1 1 3 1 2 4 4 3 2 2 1 3 1 1 4 1
[186] 1 3 4 1 1 1 1 4 4 2 3 4 1 4 1

Within cluster sum of squares by cluster:
[1] 293.9499 313.2505 377.2897 206.3201
 (between_SS / total_SS =  79.7 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"

Show code

# Adding cluster assignments to the original dataset
athletes_data$cluster <- kmeans_result$cluster

# make sure R knows that 'cluster' is a grouping variable, or factor

athletes_data$cluster <- as.factor(athletes_data$cluster)

# We can view the first few rows to see the cluster assignments
head(athletes_data)

  X    speed endurance   strength  agility athlete_id cluster
1 1 5.782400  7.761029  4.7871517 9.808228          1       4
2 2 5.218464  9.084033  2.7962335 6.346104          2       2
3 3 8.516691  3.053004  2.7185880 7.516219          3       3
4 4 6.169188  4.675156  7.6917140 9.694183          4       4
5 5 8.139396  5.967235  4.9033139 8.520897          5       4
6 6 9.910791  2.779513 -0.1064944 7.763720          6       3

In this code:

We use the kmeans function from the stats package. The centers parameter is set to 4, as determined by the Elbow method.
The nstart parameter is set to 20. This means the k-means algorithm will be run 20 times with different random starts. The best solution in terms of within-cluster sum of squares is retained. This helps in achieving a more reliable outcome, as k-means clustering is sensitive to the initial starting points.
set.seed(123) is used for reproducibility. It ensures that the random numbers generated are the same each time the code is run, which is important for replicating the results.
After running the clustering, we add the cluster assignments back to the original dataset. This allows for further analysis and exploration of the clusters.
Finally, the results of the k-means clustering, including the centroids of each cluster and the size of each cluster, are printed out for review.

Step Five: Examine the results

We can inspect the results in several ways. First, look at the cluster centroids.

Show code

kmeans_result$centers

     speed endurance  strength  agility
1 3.934395  4.277306 10.013447 2.777013
2 6.030547  9.158377  2.918393 5.952259
3 9.280093  2.994741  3.697942 8.256597
4 6.607644  5.197487  5.357964 9.445326

Check the size of each cluster. Note that cluster 1 seems to represent ‘strength’, while cluster 2 represents ‘endurance’.
You’ll notice that each athlete’s most ‘dominant’ cluster is now included in the dataset athletes_data. If we knew more about the athletes, we could now check whether our clusters are ‘accurate’ (for example, do the weightlifters tend to cluster around cluster 1?)

Step Six: Visualise the clusters

Create a plot to visualize the clusters, using ggplot2 for example. This plot shows speed against endurance for each of the four clusters. Notice that athletes in cluster 1 (red) are lower than the others on both variables.

Show code

# Create a scatter plot using ggplot
ggplot(athletes_data, aes(x = speed, y = endurance, color = factor(cluster))) +
  geom_point() +  # Add points
  labs(color = "Cluster", x = "Speed", y = "Endurance") + # Labeling
  theme_minimal() +  # Minimal theme
  scale_color_brewer(palette = "Set1")  # Use a color palette for better distinction

Step Seven: Validate our analysis

The use of four clusters seems reasonable in the context of our dataset. However, we might want to gain additional reassurance. One approach is to calcuate the **silhouette coefficient’ of our clustering.

The silhouette coefficient is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation).

We need to use the ‘cluster’ package to do this.

Show code

# Load necessary libraries
library(cluster) # for silhouette
library(stats)   # for kmeans

# Calculating silhouette width
sil_width <- silhouette(kmeans_result$cluster, dist(clustering_data))

# Average silhouette width for the entire dataset
avg_sil_width <- mean(sil_width[, 'sil_width'])

# Print the average silhouette width
print(avg_sil_width)

[1] 0.462397

Show code

# Plot the silhouette plot
plot(sil_width, col = 1:max(kmeans_result$cluster), border = NA)

The silhouette function computes the silhouette widths for each observation in our dataset. It requires two arguments: the clustering vector (from your k-means result) and the distance matrix of the data.
The dist function is used to compute the distance matrix required by silhouette.
avg_sil_width calculates the average silhouette width for all observations, providing an overall measure of the clustering quality.
Finally, a plot is generated to visually inspect the silhouette widths of individual observations. This plot can help identify which observations may be poorly clustered.

A silhouette width close to +1 indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.

If most objects have a high value, the clustering configuration is appropriate.

If many points have a low or negative value, the clustering configuration may have too many or too few clusters.

Further ideas for visualisation

One of the most useful things about cluster analysis is its ability to provide visual representations of data. Here are some examples that you might find useful:

Show code

library(ggplot2)
library(RColorBrewer)
library(gplots)
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess


ggplot(athletes_data, aes(speed, endurance, color=factor(cluster))) + geom_point() + facet_wrap(~cluster)

Show code


ggplot(athletes_data, aes(agility, strength, size=endurance, color=factor(cluster))) + geom_point(alpha=0.7)

Show code


ggplot(athletes_data, aes(factor(cluster), speed)) + geom_violin() # Repeat for each variable

Show code

ggplot(athletes_data, aes(factor(cluster), endurance)) + geom_violin()

20.5 K-means Clustering: Practice

Core task

Conduct a k-means cluster analysis on the dataset located here:

Show code

df <- read.csv('https://www.dropbox.com/scl/fi/n0qhuenart33mt33cw1m6/k_means_02.csv?rlkey=2m7dtn3haor99eecosxzxh52s&dl=1')

Step One: Data inspection and preparation

Examine the data. Check for outliers, or missing data. Try to understand the data.

Step Two: Data exploration

Calculate descriptive statistics (mean, median, standard deviation, kurtosis, skewness) for each variable.

Step Three: Identify number of clusters

Choose the number of clusters using the Elbow method to estimate the optimal number of clusters.
Apply any domain-specific knowledge to inform your choice of the number of clusters.

Step Four: Perform k-Means clustering

Perform k-means clustering.
Choose parameters like the number of clusters and the number of iterations.
Add the cluster identifier to each athlete’s data

Step Five: Analyse the results

Examine the centroids of each cluster.
Look at the distribution of data points in each cluster. Use visual approaches to help you.
Interpret the clusters in the context of the problem. This might involve understanding what each cluster represents.

Step Six: Validate the clusters

Use metrics like the silhouette coefficient to measure the quality of clustering.

Step Seven: Report the results

Consider how you would report the results of your analysis.

Show solution for practice

#-----------------------------
# Load and Expore the Data
#-----------------------------

df <- read.csv('https://www.dropbox.com/scl/fi/n0qhuenart33mt33cw1m6/k_means_02.csv?rlkey=2m7dtn3haor99eecosxzxh52s&dl=1')

head(df,10) # display the first ten rows
summary(df)
str(df)

#---------------------------
# Step Two: Data exploration
#---------------------------

library(psych)  # just one possible approach
describe(df)

#---------------------------
# Step Three: Determine optimal number of clusters
#---------------------------

clustering_data <- df[, c("endurance", "sprinting_ability", "climbing_ability")]

# Determine max number of clusters you want to test
max_clusters <- 10

# Calculate total within-cluster sum of square
wss <- sapply(1:max_clusters, function(k){
  kmeans(clustering_data, centers=k, nstart=10)$tot.withinss
})

# Plot the Elbow Plot
plot(1:max_clusters, wss, type="b", xlab="Number of clusters", ylab="Total within-cluster sum of squares")

#---------------------------
# Step Four: Run the k-means algorithm
#---------------------------

# Perform k-means clustering with 4 clusters
set.seed(123) # Set a random seed for reproducibility
kmeans_result <- kmeans(clustering_data, centers = 3, nstart = 20)

# View the results
print(kmeans_result)

# Adding cluster assignments to the original dataset
df$cluster <- kmeans_result$cluster

# We can view the first few rows to see the cluster assignments
head(df)

#---------------------------
# Step Five: Examine results
#---------------------------

kmeans_result$centers

# suggests that cluster one is clustered around sprinting ability, cluster two around climbing ability, and cluster three around endurance


# Create a scatter plot using ggplot
ggplot(df, aes(x = sprinting_ability, y = climbing_ability, color = factor(cluster))) +
  geom_point() +  # Add points
  labs(color = "Cluster", x = "Sprinting", y = "Climbing") + # Labeling
  theme_minimal() +  # Minimal theme
  scale_color_brewer(palette = "Set1")  # Use colour palette for better distinction

ggplot(df, aes(x = sprinting_ability, y = endurance, color = factor(cluster))) +
  geom_point() +  # Add points
  labs(color = "Cluster", x = "Sprinting", y = "Endurance") + # Labeling
  theme_minimal() +  # Minimal theme
  scale_color_brewer(palette = "Set1")  # Use colour palette for better distinction

#---------------------------
# Step Six: Validate clusters
#---------------------------

# Load libraries
library(cluster) # for silhouette
library(stats)   # for kmeans

# Calculating silhouette width
sil_width <- silhouette(kmeans_result$cluster, dist(clustering_data))

# Average silhouette width for the entire dataset
avg_sil_width <- mean(sil_width[, 'sil_width'])

# Print the average silhouette width
print(avg_sil_width)

# Plot the silhouette plot
plot(sil_width, col = 1:max(kmeans_result$cluster), border = NA)